En esta HDT se trabajara con el modelo svm de la librería e1071 y varias otras funciónes que requieren transformaciónes de data. Primero se nos pide que se genere una variable que defina entre economico, caro e intermedio que ya se había hecho en hojas de trabajo pasadas. Dado que los datos máximos atípicos afectan considerablemente a una predicción, no se categorizan entre ninguna de las tres clasificaciónes por lo que se producen valores NA en nuestra variable tipoDeCasa. Para solventar este problema, eliminaremos las filas que tienen NA en la variable tipoDeCasa. Tambien es importante que las variables cuantitativas no tengan NA en ellas, por lo que las cambiamos por 0 tomando en cuenta lo que representa cada variable.
## 'data.frame': 1436 obs. of 82 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : num 65 80 68 60 84 85 75 0 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 3 levels "0","Grvl","Pave": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 5 levels "0","BrkCmn","BrkFace",..: 3 4 3 4 3 4 5 5 4 4 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 5 levels "0","Ex","Fa",..: 4 4 4 5 4 4 2 4 5 5 ...
## $ BsmtCond : Factor w/ 5 levels "0","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ BsmtExposure : Factor w/ 5 levels "0","Av","Gd",..: 5 3 4 5 2 5 2 4 5 5 ...
## $ BsmtFinType1 : Factor w/ 7 levels "0","ALQ","BLQ",..: 4 2 4 2 4 4 4 2 7 4 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 7 levels "0","ALQ","BLQ",..: 7 7 7 7 7 7 7 3 7 7 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "0","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 3 6 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 6 levels "0","Ex","Fa",..: 1 6 6 4 6 1 4 6 6 6 ...
## $ GarageType : Factor w/ 7 levels "0","2Types","Attchd",..: 3 3 3 7 3 3 3 3 7 3 ...
## $ GarageYrBlt : num 2003 1976 2001 1998 2000 ...
## $ GarageFinish : Factor w/ 4 levels "0","Fin","RFn",..: 3 3 3 4 3 4 3 3 4 3 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 6 levels "0","Ex","Fa",..: 6 6 6 6 6 6 6 6 3 4 ...
## $ GarageCond : Factor w/ 6 levels "0","Ex","Fa",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 4 levels "0","Ex","Fa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Fence : Factor w/ 5 levels "0","GdPrv","GdWo",..: 1 1 1 1 1 4 1 1 1 1 ...
## $ MiscFeature : Factor w/ 5 levels "0","Gar2","Othr",..: 1 1 1 1 1 4 1 4 1 1 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
## $ tipoDeCasa : num 3 2 3 1 3 1 3 2 1 1 ...
Vamos a hacer un analisis de correlación, para descartar las variables cuantitativas que tienen mucha correlación entre ellas y así facilitar la fase del modelo. Para esto, se decidió separar la base de datos en dos conjuntos de datos, ya que son muchas columnas. La primer matriz hace referencia al conjunto1, la segunda hace referencia al conjunto2, la tercera es la matriz de correlación entre conjunto1 y conjunto2. Por último tenemos una tabla comparativa que nos permite ver con mayor facilidad la correlación entre la variable tipoDeCasa y todas las demás.
## [,1]
## LotFrontage 0.124781325
## LotArea 0.219555016
## OverallQual 0.738892019
## OverallCond -0.095483076
## YearBuilt 0.565329689
## YearRemodAdd 0.544774124
## MasVnrArea 0.344254632
## BsmtFinSF1 0.276950433
## BsmtFinSF2 0.012463641
## BsmtUnfSF 0.208466092
## TotalBsmtSF 0.513208477
## X1stFlrSF 0.496997346
## X2ndFlrSF 0.297500739
## LowQualFinSF -0.068654379
## GrLivArea 0.622251690
## BsmtFullBath 0.189408729
## BsmtHalfBath -0.035329884
## FullBath 0.574967912
## HalfBath 0.307309514
## BedroomAbvGr 0.157624456
## KitchenAbvGr -0.167361574
## TotRmsAbvGrd 0.441616356
## Fireplaces 0.454122445
## GarageYrBlt 0.262710019
## GarageCars 0.610605411
## GarageArea 0.575708954
## WoodDeckSF 0.281222474
## OpenPorchSF 0.326152417
## EnclosedPorch -0.138763225
## X3SsnPorch 0.059689959
## ScreenPorch 0.099859557
## PoolArea 0.051525222
## MiscVal -0.001375927
## MoSold 0.070906122
## YrSold -0.012373046
## tipoDeCasa 1.000000000
Basandonos en la gráfica de correlación y el cuadro de correlación, las variables con mayor correlación con la variable tipoDeCasa son:
Tambien sabemos que las variables que tienen una correlación entre ellas mayor a 0.6 son:
A continuación se verifica la correlación con la variable tipoDeCasa y se quitan las variables con menor correlación entre los conjuntos que tenían correlacion mutua, por lo tanto sacaremos a las siguientes variables:
Uno de los principales requisitos es que los “factors” sean de al menos 2 niveles para poder ser ingresados a la función de svm. En totala se realizarán nueve modelos de los cuales 3 son lineles, 3 son radiales y 3 son polinomiales. Se cambiaron factores como costo, gamma, degree y coef0 para tener diferentes combinaciones que nos dieran diferentes resultados y en base a cada coonvinación, se buscaron numeros que promovieran una mayor precisión en la predicción.
##
## Call:
## svm(formula = tipoDeCasa ~ ., data = train, type = "C-classification",
## cost = 2^5, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 32
##
## Number of Support Vectors: 337
##
## ( 91 153 93 )
##
##
## Number of Classes: 3
##
## Levels:
## 1 2 3
##
## Call:
## svm(formula = tipoDeCasa ~ ., data = train, type = "C-classification",
## gamma = 0.005, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 641
##
## ( 164 305 172 )
##
##
## Number of Classes: 3
##
## Levels:
## 1 2 3
##
## Call:
## svm(formula = tipoDeCasa ~ ., data = train, type = "C-classification",
## gamma = 1, kernel = "polynomial", coef0 = 1, degree = 8)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 8
## coef.0: 1
##
## Number of Support Vectors: 826
##
## ( 175 325 326 )
##
##
## Number of Classes: 3
##
## Levels:
## 1 2 3
Para esta parte hay que asegurar que ambos factores que se van a comparar que en este caso es predicción y test, es requerido que sean factores del mismo nivel, por lo que no puede haber ninún tipo de fallo en el formato de ambos factores. A continuación se pueden ver los resultados de los 9 modelos que se realizaron.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 128 29 1
## 2 24 104 22
## 3 1 33 89
##
## Overall Statistics
##
## Accuracy : 0.7448
## 95% CI : (0.7009, 0.7853)
## No Information Rate : 0.3852
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6143
##
## Mcnemar's Test P-Value : 0.4451
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8366 0.6265 0.7946
## Specificity 0.8921 0.8264 0.8934
## Pos Pred Value 0.8101 0.6933 0.7236
## Neg Pred Value 0.9084 0.7794 0.9253
## Prevalence 0.3550 0.3852 0.2599
## Detection Rate 0.2970 0.2413 0.2065
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.8643 0.7265 0.8440
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 140 18 0
## 2 22 106 22
## 3 1 24 98
##
## Overall Statistics
##
## Accuracy : 0.7981
## 95% CI : (0.7571, 0.835)
## No Information Rate : 0.3782
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6953
##
## Mcnemar's Test P-Value : 0.6853
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8589 0.7162 0.8167
## Specificity 0.9328 0.8445 0.9196
## Pos Pred Value 0.8861 0.7067 0.7967
## Neg Pred Value 0.9158 0.8505 0.9286
## Prevalence 0.3782 0.3434 0.2784
## Detection Rate 0.3248 0.2459 0.2274
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.8959 0.7804 0.8681
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 147 11 0
## 2 19 115 16
## 3 0 23 100
##
## Overall Statistics
##
## Accuracy : 0.8399
## 95% CI : (0.8018, 0.8733)
## No Information Rate : 0.3852
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7581
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8855 0.7718 0.8621
## Specificity 0.9585 0.8759 0.9270
## Pos Pred Value 0.9304 0.7667 0.8130
## Neg Pred Value 0.9304 0.8790 0.9481
## Prevalence 0.3852 0.3457 0.2691
## Detection Rate 0.3411 0.2668 0.2320
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.9220 0.8238 0.8945
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 149 9 0
## 2 23 117 10
## 3 1 32 90
##
## Overall Statistics
##
## Accuracy : 0.826
## 95% CI : (0.7868, 0.8606)
## No Information Rate : 0.4014
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.736
##
## Mcnemar's Test P-Value : 0.0003231
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8613 0.7405 0.9000
## Specificity 0.9651 0.8791 0.9003
## Pos Pred Value 0.9430 0.7800 0.7317
## Neg Pred Value 0.9121 0.8541 0.9675
## Prevalence 0.4014 0.3666 0.2320
## Detection Rate 0.3457 0.2715 0.2088
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.9132 0.8098 0.9002
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 143 15 0
## 2 23 119 8
## 3 2 37 84
##
## Overall Statistics
##
## Accuracy : 0.8028
## 95% CI : (0.762, 0.8393)
## No Information Rate : 0.3968
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7003
##
## Mcnemar's Test P-Value : 5.455e-05
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8512 0.6959 0.9130
## Specificity 0.9430 0.8808 0.8850
## Pos Pred Value 0.9051 0.7933 0.6829
## Neg Pred Value 0.9084 0.8149 0.9740
## Prevalence 0.3898 0.3968 0.2135
## Detection Rate 0.3318 0.2761 0.1949
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.8971 0.7883 0.8990
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 145 13 0
## 2 19 122 9
## 3 0 31 92
##
## Overall Statistics
##
## Accuracy : 0.8329
## 95% CI : (0.7943, 0.8669)
## No Information Rate : 0.3852
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7467
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8841 0.7349 0.9109
## Specificity 0.9513 0.8943 0.9061
## Pos Pred Value 0.9177 0.8133 0.7480
## Neg Pred Value 0.9304 0.8434 0.9708
## Prevalence 0.3805 0.3852 0.2343
## Detection Rate 0.3364 0.2831 0.2135
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.9177 0.8146 0.9085
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 145 13 0
## 2 23 120 7
## 3 1 46 76
##
## Overall Statistics
##
## Accuracy : 0.7912
## 95% CI : (0.7497, 0.8286)
## No Information Rate : 0.4153
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.682
##
## Mcnemar's Test P-Value : 4.154e-07
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8580 0.6704 0.9157
## Specificity 0.9504 0.8810 0.8649
## Pos Pred Value 0.9177 0.8000 0.6179
## Neg Pred Value 0.9121 0.7900 0.9773
## Prevalence 0.3921 0.4153 0.1926
## Detection Rate 0.3364 0.2784 0.1763
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.9042 0.7757 0.8903
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 143 15 0
## 2 23 108 19
## 3 0 23 100
##
## Overall Statistics
##
## Accuracy : 0.8144
## 95% CI : (0.7744, 0.85)
## No Information Rate : 0.3852
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7197
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8614 0.7397 0.8403
## Specificity 0.9434 0.8526 0.9263
## Pos Pred Value 0.9051 0.7200 0.8130
## Neg Pred Value 0.9158 0.8648 0.9383
## Prevalence 0.3852 0.3387 0.2761
## Detection Rate 0.3318 0.2506 0.2320
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.9024 0.7962 0.8833
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 140 18 0
## 2 21 111 18
## 3 0 20 103
##
## Overall Statistics
##
## Accuracy : 0.8213
## 95% CI : (0.7819, 0.8564)
## No Information Rate : 0.3735
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7304
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8696 0.7450 0.8512
## Specificity 0.9333 0.8617 0.9355
## Pos Pred Value 0.8861 0.7400 0.8374
## Neg Pred Value 0.9231 0.8648 0.9416
## Prevalence 0.3735 0.3457 0.2807
## Detection Rate 0.3248 0.2575 0.2390
## Detection Prevalence 0.3666 0.3480 0.2854
## Balanced Accuracy 0.9014 0.8033 0.8934